Text copied to clipboard!

Title

Text copied to clipboard!

Site Reliability Engineer SRE

Description

Text copied to clipboard!
We are looking for a Site Reliability Engineer to join our growing technology team. As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our systems and services. You will work closely with software engineers, system administrators, and other stakeholders to build and maintain scalable infrastructure, automate operational tasks, and respond to incidents effectively. The ideal candidate will have a strong background in systems engineering, cloud infrastructure, and software development. You will be expected to design and implement monitoring solutions, develop tools to improve system reliability, and participate in on-call rotations to address production issues. Your work will directly impact the user experience by minimizing downtime and ensuring high availability of our services. As an SRE, you will also be responsible for conducting post-incident reviews, identifying root causes, and implementing long-term solutions to prevent recurrence. You will champion best practices in system design, deployment, and maintenance, and help foster a culture of reliability and continuous improvement across the organization. This role requires excellent problem-solving skills, a proactive mindset, and the ability to work in a fast-paced, collaborative environment. If you are passionate about building reliable systems and enjoy working at the intersection of software and operations, we encourage you to apply.

Responsibilities

Text copied to clipboard!
  • Design, build, and maintain scalable and reliable infrastructure
  • Develop and implement monitoring and alerting systems
  • Automate operational tasks and improve system efficiency
  • Participate in on-call rotations and respond to incidents
  • Conduct root cause analysis and post-incident reviews
  • Collaborate with development teams to improve system architecture
  • Ensure high availability and performance of services
  • Implement security best practices and compliance standards
  • Manage cloud infrastructure and deployment pipelines
  • Continuously improve system reliability and operational processes

Requirements

Text copied to clipboard!
  • Bachelor’s degree in Computer Science or related field
  • 3+ years of experience in Site Reliability Engineering or DevOps
  • Strong knowledge of Linux/Unix systems
  • Experience with cloud platforms such as AWS, GCP, or Azure
  • Proficiency in scripting languages like Python, Bash, or Go
  • Familiarity with CI/CD tools and practices
  • Experience with monitoring tools like Prometheus, Grafana, or Datadog
  • Understanding of networking, security, and system architecture
  • Excellent troubleshooting and problem-solving skills
  • Strong communication and collaboration abilities

Potential interview questions

Text copied to clipboard!
  • What experience do you have with cloud infrastructure?
  • Can you describe a time you resolved a major system outage?
  • What monitoring tools have you used in previous roles?
  • How do you approach automating repetitive tasks?
  • What is your experience with CI/CD pipelines?
  • How do you ensure system security and compliance?
  • Describe your experience with on-call rotations.
  • What scripting languages are you most comfortable with?
  • How do you handle post-incident reviews?
  • What strategies do you use to improve system reliability?